Skip to content

feat(types): UUID Arrow type migration — pa.binary(16) for all UUID storage (PLT-1162)#161

Merged
eywalker merged 27 commits into
mainfrom
eywalker/plt-1162-design-spike-uuid-arrow-type-mapping-large_string-vs
Jun 13, 2026
Merged

feat(types): UUID Arrow type migration — pa.binary(16) for all UUID storage (PLT-1162)#161
eywalker merged 27 commits into
mainfrom
eywalker/plt-1162-design-spike-uuid-arrow-type-mapping-large_string-vs

Conversation

@kurodo3

@kurodo3 kurodo3 Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace pa.large_string() UUID storage with pa.binary(16) (fixed_size_binary[16]) across the entire OrcaPod codebase, introducing a semantic struct converter and updating all UUID generation and storage sites.

Changes

Area What changed
types.py No new constants added — each call site uses pa.binary(16) directly. Adding module-level lazy constants would trigger PyArrow import at types import time, defeating the LazyModule pattern already used there.
semantic_struct_converters.py UUIDStructConverter with full SemanticStructConverterProtocol compliance
versioned_hashers.py UUIDStructConverter registered under "uuid" in the default semantic type registry
builtin_handlers.py UUIDHandler.handle() returns obj.bytes (16 raw bytes) instead of str(obj)
arrow_utils.py System record_id tag column type → pa.binary(16); cascading fixes to universal_converter.py, join.py, merge_join.py, stream_builder.py, cached_source.py
datagram.py, data_function.py datagram_id now returns uuid.UUID (normalized from uuid_utils.uuid7); source-info UUID segment uses .hex
logging_observer.py, status_observer.py uuid7().bytes (raw 16 bytes) instead of str(uuid7()) for _log_id / _status_id Arrow columns
postgresql_connector.py uuid PG type → pa.binary(16); _coerce_pg_value() converts driver uuid.UUID objects to .bytes; TODO comment removed
databases/utils.py coerce_record_id() helper: transparently encodes str record IDs to bytes at DB method boundaries
superpowers/specs/, superpowers/plans/ Design spec and implementation plan

Test plan

  • 3616 tests pass, 17 skipped (live DB integration), 0 failures
  • Full UUIDStructConverter suite (19 tests) including Arrow array round-trip, bytearray handling, protocol method coverage
  • UUIDHandler returns bytes test
  • PostgreSQL type mapping unit test (pa.binary(16) for uuid)
  • Semantic registry registration verified end-to-end

Notes

Breaking changes (pre-v0.1.0, intentional per spec):

  • datagram_id is now uuid.UUID (returned by Datagram.datagram_id, Tag.datagram_id, Data.datagram_id). Any code assuming it was str will break.
  • record_id (DB pipeline entry IDs), _log_id, _status_id are now bytes (16 bytes). Any code assuming these were str will break.

Polars write_json() limitation: pola-rs/polars#15410 — OrcaPod has zero write_json() calls, unaffected in practice.

SQLite: No change required — TEXT columns remain pa.large_string(). Arrow type round-trip fidelity for loosely-typed backends is tracked in PLT-1615.

run_id: Intentionally unchanged — it's a transient Python string passed to observer callbacks, never stored in Arrow.

Fixes PLT-1162

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

kurodo3 Bot and others added 14 commits June 11, 2026 23:45
Formalises pa.binary(16) as the canonical Arrow type for all UUID values
in OrcaPod — both system identifiers (UUID_ARROW_TYPE) and user-facing
semantic round-trips (UUID_STRUCT_ARROW_TYPE). Documents the two-layer
design, no-truncation rule, known Polars write_json limitation, and full
implementation scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… (PLT-1162)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y via module __getattr__ (PLT-1162)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uct round-trips (PLT-1162)

Implements UUIDStructConverter in semantic_struct_converters.py following
the existing PythonPathStructConverter/UPathStructConverter pattern.
Stores UUIDs as fixed 16-byte binary in a single-field Arrow struct.
Duck-typed to also accept uuid_utils.UUID objects (uuid7 etc.) which do
not inherit from uuid.UUID but expose .bytes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ter (PLT-1162)

Adds `can_handle_python_type`, `can_handle_struct_type`, and `hash_struct_dict`
to `UUIDStructConverter` so it fully satisfies `SemanticStructConverterProtocol`.
Also adds 7 new tests covering the new methods.
…c registry (PLT-1162)

Register UUIDStructConverter in get_versioned_semantic_arrow_hasher so that
uuid.UUID values are handled by the default semantic type registry. Add three
tests verifying the registration by python type, struct signature, and semantic
type name.
…th binary Arrow storage (PLT-1162)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…LT-1162)

- Change record_id system tag column type from pa.large_string() to
  UUID_ARROW_TYPE (pa.binary(16)) in arrow_utils.add_system_tag_columns
- Import UUID_ARROW_TYPE from orcapod.types in arrow_utils.py
- Update stream_builder.py to generate deterministic UUID v5 bytes for
  record_id values (derived from source_id + provenance token), replacing
  the previous string-based provenance tokens passed to the system tag column
- Add fixed_size_binary support to UniversalTypeConverter.arrow_type_to_python_type
  so UUID_ARROW_TYPE columns map to Python bytes
- Fix _predict_system_tag_schema in Join and MergeJoin operators to carry
  through the actual column type (bytes for record_id) rather than hardcoding str
- Fix CachedSource to exclude random system-tag record_id columns from the
  row hash used for deduplication (record_ids are now deterministic UUIDs
  per source, so dedup is stable)
- Update tests that passed string record_ids to pass UUID bytes instead

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…D binary migration (PLT-1162)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- datagram.py: datagram_id property returns bytes (uuid7().bytes), update
  record_id parameter and internal method signatures from str|None to bytes|None
- data_function.py: generate uuid7() once, use .bytes for record_id Arrow storage
  and .hex for source_info provenance strings (source_info stays large_string)
- logging_observer.py: log_id uses uuid7().bytes with pa.binary(16) column type
- status_observer.py: status_id uses uuid7().bytes with pa.binary(16) column type
  in both _ContextualizedStatusObserver and _ScopedStatusObserver
- Cascade updates to maintain consistency:
  - protocols/core_protocols/datagrams.py: datagram_id protocol return type bytes
  - tag_data.py: record_id parameter annotations bytes|None
  - function_node.py: data_record_id bytes, DATA_RECORD_ID stored as pa.binary(16)
  - databases: _ensure_record_id_column, add_record accept str|bytes in all impls
  - database_protocols.py: add_record protocol updated to str|bytes
- Fix two tests broken by the type change: source_info UUID format (dashes→hex)
  and data_record_id column type (str→bytes with len==16)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nary(16)) (PLT-1162)

- Move uuid out of the grouped string branch in _pg_type_to_arrow so it
  returns UUID_ARROW_TYPE = pa.binary(16) instead of pa.large_string()
- Add _coerce_pg_value helper to convert psycopg uuid.UUID objects to
  bytes before Arrow array construction in iter_batches
- Update test_uuid in TestPgTypeToArrow to assert pa.binary(16)
- Import _coerce_pg_value in test file for completeness
@kurodo3 kurodo3 Bot force-pushed the eywalker/plt-1162-design-spike-uuid-arrow-type-mapping-large_string-vs branch from 414d87e to 0c922b7 Compare June 11, 2026 23:47

@eywalker eywalker left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing through, I don't think that the use of UUID TYPE constant is actually helping with anything -- let's just make that fixed to be defined as binary type and drop use of constants as we as the getattr trick.

c for c in live_table.column_names
if c.startswith(constants.SYSTEM_TAG_RECORD_ID_PREFIX)
]
hash_table = (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this some kind of a bug that was present before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was a latent pre-existing bug exposed by the migration.

Before this PR, was str(uuid7()) — a fresh random string on every run. Since the dedup hash included the record_id column and that value changed every invocation, the cached-source dedup never actually recognised repeat rows as duplicates. It was silently broken.

The migration made it visible: deterministic UUID v5 record_ids meant the hash would be stable across runs, which is correct — but we also don't need to include the record_id in the hash at all, since it's derived from the source_id + provenance token already captured in the hash. So the fix excludes record_id tag columns from the dedup hash (removing redundant information) while making the dedup work correctly for the first time.


# Namespace UUID for source record IDs — all source record IDs are derived
# from this namespace and the row's source_id + provenance token.
_SOURCE_RECORD_ID_NAMESPACE = uuid.UUID("7f8e9d0c-1b2a-3c4d-5e6f-7a8b9c0d1e2f")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this value chosen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was arbitrary — chosen once by the implementer with no principled derivation, which is why it looked opaque.

Fixed: replaced with a UUID v5 derived from uuid.NAMESPACE_URL + a stable OrcaPod-specific URL string, so the value is reproducible from its inputs rather than being a magic hex literal:

_SOURCE_RECORD_ID_NAMESPACE = uuid.uuid5(
    uuid.NAMESPACE_URL,
    "https://orcapod.io/namespaces/source-record-id",
)

Added a doc comment noting this value must never change once deployed, since changing it would invalidate all existing stored record IDs.


@property
def datagram_id(self) -> str:
def datagram_id(self) -> bytes:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this be cleaner to return an actual UUID object instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — done. datagram_id now returns uuid.UUID throughout the protocol, Datagram, and Tag/Data. Generation uses uuid7() normalised to stdlib uuid.UUID (for cross-library equality). Arrow array construction calls .bytes at the point of actual serialisation, keeping the Python interface ergonomic.

Comment thread src/orcapod/__init__.py Outdated


def __getattr__(name: str) -> object:
"""Lazy resolution for module-level constants that depend on pyarrow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which constant would have triggered the invocation of pyarrow and why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing either constant triggered it. The __getattr__ in types.py called import pyarrow as _pa to construct the value on first access. So any from orcapod.types import UUID_ARROW_TYPE or from orcapod import UUID_ARROW_TYPE call would load the full pyarrow C-extension immediately — which made the __init__.py eager import of the constant just as non-lazy. The __init__.py __getattr__ was needed to defer even that.

This whole mechanism is removed now per the overall review direction: the constants are gone, both __getattr__ functions are removed, and pa.binary(16) is used inline at each site.

assert len(values) == result_table.num_rows
for val in values:
assert isinstance(val, str)
# source_id columns are large_string (str), record_id columns are

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perform a more thorough test based on the target column

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The assertion now checks per column: record_id columns (matching SYSTEM_TAG_RECORD_ID_PREFIX) assert isinstance(val, bytes) and len(val) == 16; all other system tag columns assert isinstance(val, str). This catches any accidental type regression in either direction.

…atagram_id, per-column tests (PLT-1162)

- Remove UUID_ARROW_TYPE and UUID_STRUCT_ARROW_TYPE constants and their lazy __getattr__ resolvers
  from orcapod.types and orcapod.__init__; use pa.binary(16) / pa.struct([...]) inline everywhere
- Replace _SOURCE_RECORD_ID_NAMESPACE opaque hex literal with a principled uuid.uuid5 derivation
  with doc comment explaining it must never change
- Change datagram_id property return type from bytes to uuid.UUID throughout
  (protocol, Datagram, Tag, Data); generation via uuid7() normalised to stdlib uuid.UUID
  to ensure cross-library equality; call sites that build pa.binary(16) arrays use .bytes
- Strengthen system-tag column type assertion in test_operators to per-column checks:
  record_id columns asserted as bytes+len==16, source_id columns asserted as str

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Review round 1 — addressed in commit 13a1d5a

All comments addressed. Summary:

Drop UUID_ARROW_TYPE / UUID_STRUCT_ARROW_TYPE constants + __getattr__ trick (overall review + question on __init__.py)
Removed both named constants and both __getattr__ functions (types.py and __init__.py). Every site now uses pa.binary(16) or pa.struct([pa.field("uuid", pa.binary(16))]) inline. The __getattr__ mechanism was needed because importing either constant immediately triggered pyarrow loading — that whole layer is gone.

cached_source.py dedup hash — yes, pre-existing latent bug
The old code included the record_id column in the dedup hash. Since record_id was str(uuid7()) (fresh random on every run), the hash was always different, so cached-source dedup was silently never matching. The migration made this visible. The fix excludes record_id columns from the hash (they're derived from data already in the hash) and makes the dedup work correctly for the first time.

_SOURCE_RECORD_ID_NAMESPACE — now principled
Replaced the opaque hex literal with uuid.uuid5(uuid.NAMESPACE_URL, "https://orcapod.io/namespaces/source-record-id") plus a doc comment that this value must never change once deployed.

datagram_id returns uuid.UUID
Accepted. Protocol, Datagram, Tag, Data all return uuid.UUID. Arrow array construction calls .bytes at the serialisation point.

Per-column system-tag type assertions in test_operators.py
Replaced the weak isinstance(val, (str, bytes)) check with per-column assertions: record_id columns → bytes + len == 16; all others → str.

Comment thread src/orcapod/core/datagrams/datagram.py Outdated
"""Return (or lazily generate) the datagram's unique ID."""
if self._datagram_id is None:
self._datagram_id = str(uuid7())
# uuid7() returns a uuid_utils.UUID; normalise to stdlib uuid.UUID

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to mix uuid_utils and stdlib for handling uuid? Does uuid_utils offer richer set of functions for uuid generation/handling?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — is only here for UUIDv7 generation (time-ordered, monotonic), which isn't in the stdlib before Python 3.12. However and are distinct types that don't compare equal even for identical bit patterns, so we normalise immediately to on creation. The comment has been expanded to make this explicit.

f"{stream.pipeline_hash().to_hex(n_char)}:{idx}"
)
system_tag_fields[new_name] = str
system_tag_fields[new_name] = col_type

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include tests to ensure that the removal force convresion into str is not creating a new error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added in . It iterates over all system-tag columns in the predicted schema after a two-way join and asserts that record_id columns map to (from ) and source_id columns map to — verifying that is carried through unchanged.

)

# Compute per-row record hashes for dedup: hash(full row)
# Compute per-row record hashes for dedup: hash(full row excluding

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I acutally don't understand the logic here. For system tag, the source_id is the unique specifier of the nature/identity of the source and the record_id is a unique specifier about the origin of the specific record. If we are deeming record_id to be truly redundant with record_hash, then we should use record_id as the record_hash. Shouldn't it be the case that we'd actually include everything that's part of the identity of the row as part of the record hash so that in case two entries with everything identical BUT the record_id, suggesting that they originated from two distinct entries in the source that happens to carry identical values, we can correctly distinguish them? Otherwise, of course the inclusion of the rest of the fields are important to detect situation where same record_id entries end up having distinct values but we still want to track it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — I reverted the exclusion. The record_id system tag encodes row origin (source identity + provenance token), so two rows that are byte-identical in their data columns but originate from different positions in the source will have distinct record_id values and must be stored as separate entries. Hashing the full live_table (including the record_id column) is correct. The dedup logic now hashes all system tag columns.

# record IDs.
_SOURCE_RECORD_ID_NAMESPACE = uuid.uuid5(
uuid.NAMESPACE_URL,
"https://orcapod.io/namespaces/source-record-id",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this typically set to? Also be sure to use Orcapod instead of OrcaPod throughout the codebase. If this is meant to be a namespace URL that we would maintain, make it https://orcapod.org/namespaces/source-record-id

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed:

  • URL corrected to https://orcapod.org/namespaces/source-record-id (was http://orcapod.ai/...)
  • Capitalisation corrected to "Orcapod"
  • Added the pre-computed value as a comment: uuid.UUID('877ec89b-6645-5852-ba37-a94604043f5e')

return f"row_{row_index}"


def _make_record_id_bytes(source_id: str, provenance_token: str) -> bytes:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's instead return uuid.UUID for consistency?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — _make_record_id_bytes renamed to _make_record_id and now returns a uuid.UUID rather than raw bytes. The call site does .bytes to get the 16-byte value for Arrow storage. This gives callers a real UUID object (with full UUID API surface) instead of an opaque byte string.

self._scoped_path = _scoped_path
self._pending_batches: dict[str, pa.Table] = _shared_pending_batches if _shared_pending_batches is not None else {}
self._pending_record_ids: dict[str, set[str]] = _shared_pending_record_ids if _shared_pending_record_ids is not None else defaultdict(set)
self._pending_record_ids: dict[str, set[str | bytes]] = _shared_pending_record_ids if _shared_pending_record_ids is not None else defaultdict(set)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still have to take in str?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped str entirely. compute_pipeline_entry_id now returns bytes (a method-prefixed digest via ContentHash.to_prefixed_digest()), and UUID-based record IDs are stored as 16-byte raw values — both are bytes. There is no longer a caller path that passes a str record ID, so the str union arm was just dead code.

self, arrow_data: pa.Table, record_id: str | bytes
) -> pa.Table:
if self.RECORD_ID_COLUMN not in arrow_data.column_names:
arrow_type = pa.binary(16) if isinstance(record_id, bytes) else pa.large_string()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? What currently can yield incompatible str record_id?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up — record_id is now bytes only and _ensure_record_id_column always creates a pa.large_binary() column (no conditional branch on type). pa.large_binary() is used uniformly for all database record ID columns: it accommodates both fixed-size UUID bytes (16 bytes) and variable-length method-prefixed pipeline entry digests without requiring a schema change if sizes shift.

if self.RECORD_ID_COLUMN not in arrow_data.column_names:
# Add record_id column at the beginning
key_array = pa.array([record_id] * len(arrow_data), type=pa.large_string())
arrow_type = pa.binary(16) if isinstance(record_id, bytes) else pa.large_string()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should drop support for large_string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped large_string support entirely. _ensure_record_id_column now always creates a pa.large_binary() column unconditionally, and all str | bytes / str type annotations in the Delta Lake DB layer have been updated to bytes only — matching the protocol and the other database implementations.

…o_prefixed_digest()

Addresses second round of PR review comments on the UUID Arrow type migration.

Key changes:
- Add `ContentHash.to_prefixed_digest()` → `b"{method}:{digest}"` for
  human-readable method-prefixed binary record IDs
- `compute_pipeline_entry_id()` now returns `bytes` via `to_prefixed_digest()`;
  update all `dict[str, ...]` caches in FunctionNode to `dict[bytes, ...]`
- Migrate all database record-ID storage from `pa.binary(16)` → `pa.large_binary()`
  so UUID-based and hash-based IDs share a single consistent Arrow type
- Drop `str | bytes` unions from `ArrowDatabaseProtocol`, all DB implementations,
  and `_pending_record_ids`; record IDs are now uniformly `bytes`
- `_make_record_id_bytes` → `_make_record_id` returning `uuid.UUID` (caller calls `.bytes`)
- Namespace URL fixed: `orcapod.io` → `orcapod.org`, "Orcapod" capitalisation,
  computed UUID value added as comment
- Revert C3: include all system tag columns (including record_id) in dedup hash
  so rows with identical data values but different provenance are not collapsed
- Add test for `col_type` preservation through `_predict_system_tag_schema`
- Improve datagram.py comment explaining uuid_utils vs stdlib normalisation
- Update all DB and FunctionNode tests to use `bytes` record IDs

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Round 2 review — summary of changes

All review comments addressed in commit fix(uuid): migrate record IDs to pa.large_binary(); add ContentHash.to_prefixed_digest().

C1 — uuid_utils vs stdlib UUID (datagram.py)

Expanded the comment to explain both why uuid_utils is used (UUIDv7 generation, not yet in stdlib before 3.12) and why we normalise to uuid.UUID immediately (cross-library type inequality even for identical bit patterns).

C2 — col_type propagation through _predict_system_tag_schema (join.py)

Added test_output_schema_preserves_system_tag_col_types to TestJoinOutputSchemaSystemTags. The test verifies that record_id system-tag columns carry bytes and source_id columns carry str through output_schema().

C3 — record_id in dedup hash (cached_source.py)

Reverted the exclusion. record_id encodes row origin (source identity + provenance token), so two identical-data rows from different source positions must produce distinct hash entries. The dedup loop now hashes the full live_table including all system tag columns.

C4 — Namespace URL / capitalisation (stream_builder.py)

  • URL corrected to https://orcapod.org/namespaces/source-record-id
  • Capitalisation corrected to "Orcapod"
  • Pre-computed UUID value added as a comment: uuid.UUID('877ec89b-6645-5852-ba37-a94604043f5e')

C5 — _make_record_id return type (stream_builder.py)

Renamed _make_record_id_bytes_make_record_id; function now returns uuid.UUID. Call site uses .bytes for Arrow storage.

C6 / C7 — Drop str from connector_arrow_database.py

record_id is now bytes only throughout. _ensure_record_id_column always creates pa.large_binary() unconditionally. No caller passes str record IDs — the union was dead code.

C8 — Drop large_string from delta_lake_databases.py

Same change applied to the Delta Lake DB layer: _ensure_record_id_column always uses pa.large_binary(), and all str | bytes / str type annotations updated to bytes.

Broader scope (discussed and agreed during review)

  • pa.large_binary() for all database record ID columns — replaces pa.binary(16) in arrow_utils.py. Accommodates both fixed-size UUID bytes and variable-length method-prefixed pipeline entry digests without a future schema migration.
  • ContentHash.to_prefixed_digest() — new method returning b"{method}:{digest}" bytes. Used by compute_pipeline_entry_id(), which now returns bytes instead of a hex string.
  • Protocol and all DB implementationsadd_record, get_record_by_id, get_records_by_ids all updated to bytes-only signatures across database_protocols.py, noop_database.py, in_memory_databases.py, connector_arrow_database.py, delta_lake_databases.py.
  • All tests passing (3616 tests, 56 skipped).

kurodo3 Bot and others added 4 commits June 12, 2026 01:53
…ID mapping

- test-objective/unit/test_databases.py: update all string record IDs to bytes
  literals (b"...") to match the bytes-only record_id API introduced in the
  large_binary migration. String IDs were silently encoded by PyArrow but then
  not found during set-membership lookup (str != bytes).

- postgresql_connector.py: map PostgreSQL uuid type to pa.large_string() instead
  of pa.binary(16), completing the TODO: PLT-1162 marker in the integration test.
  psycopg returns uuid.UUID objects; the value coercion now calls str(value)
  to produce the canonical "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" form.

- tests/test_databases/test_postgresql_connector.py: update test_uuid assertion
  to expect pa.large_string().
- tests/test_databases/test_postgresql_connector_integration.py: remove the
  # TODO: PLT-1162 comment now that the assertion is fulfilled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a shared ``coerce_record_id(str | bytes) -> bytes`` helper in
``databases/_utils.py`` and apply it at the entry of every public method
that accepts a ``record_id`` parameter:

- ``add_record`` / ``get_record_by_id`` / ``get_records_by_ids`` in
  ``InMemoryArrowDatabase``, ``ConnectorArrowDatabase``,
  ``DeltaTableDatabase``, and ``NoOpArrowDatabase``
- ``_ensure_record_id_column`` and ``_create_record_id_filter`` in
  ``DeltaTableDatabase``

The protocol signatures are updated to ``str | bytes`` to reflect that both
are accepted.  Internally all paths remain ``bytes`` after coercion; the
Arrow storage type stays ``pa.large_binary()``.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l only

The str → bytes coercion is a convenience feature of the concrete database
implementations, not part of the ArrowDatabaseProtocol contract.  The
protocol signatures stay bytes-only so that callers coding to the protocol
get a clear type contract; the leniency lives in the implementing classes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No leading underscore; the file is not private to the package,
just a shared utility module.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/orcapod/core/data_function.py Outdated
import logging
import re
import sys
import uuid as _uuid_module

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the need to name this _uuid_module?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — removed the alias. Changed to plain import uuid and updated the usage to uuid.UUID(bytes=uuid7().bytes).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates OrcaPod’s UUID handling from string-based Arrow storage toward binary representations, updating semantic type conversion, hashing/caching IDs, and database record ID APIs to work with bytes and fixed-size binary types.

Changes:

  • Introduces a UUIDStructConverter and registers it in the default semantic registry; updates UUIDHandler to hash UUIDs via raw bytes.
  • Switches many internal identifiers (system tags, pipeline entry IDs, observers, DB record IDs) from str to bytes and updates tests accordingly.
  • Expands type conversion support for fixed_size_binarybytes, and updates multiple DB implementations/connectors to store record IDs in binary columns.

Reviewed changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_utils/test_arrow_utils.py Updates system tag record_id test input to use UUID bytes.
tests/test_semantic_types/test_uuid_struct_converter.py Adds full unit coverage for UUIDStructConverter.
tests/test_semantic_types/test_semantic_registry.py Verifies UUID converter registration and reverse lookup.
tests/test_hashing/test_uuid_handler.py Adds tests asserting UUID handler returns 16-byte form.
tests/test_databases/test_postgresql_connector.py Updates tests around PG type mapping/coercion (UUID mapping still asserted as string).
tests/test_databases/test_postgresql_connector_integration.py Updates integration expectations (UUID mapping still asserted as string).
tests/test_databases/test_noop_database.py Updates record_id arguments to bytes.
tests/test_databases/test_in_memory_database.py Updates record_id arguments/expectations to bytes.
tests/test_databases/test_delta_table_database.py Updates record_id arguments/expectations to bytes.
tests/test_databases/test_delta_table_database_s3.py Updates record_id arguments/expectations to bytes.
tests/test_databases/test_connector_arrow_database.py Updates record_id arguments/expectations to bytes.
tests/test_core/operators/test_operators.py Adds assertions that system tag schemas preserve bytes vs str types.
tests/test_core/nodes/test_function_node_iteration.py Updates cached entry-id key type to bytes.
tests/test_core/function_pod/test_function_pod_node.py Updates data record-id assertions to 16-byte values.
tests/test_core/function_pod/test_function_node_caching.py Updates pipeline entry-id type assertions to bytes.
tests/test_core/data_function/test_data_function.py Adjusts source-info UUID segment expectations to hex form.
test-objective/unit/test_databases.py Updates objective tests to use bytes record IDs and binary Arrow arrays.
test-objective/unit/test_arrow_data_utils.py Updates system-tag tests to pass UUID bytes as record IDs.
superpowers/specs/2026-06-11-plt-1162-uuid-arrow-type-design.md Adds design spec for UUID Arrow type migration.
superpowers/plans/2026-06-11-plt-1162-uuid-arrow-type.md Adds implementation plan/checklist for migration steps.
src/orcapod/utils/arrow_utils.py Changes system tag record_ids typing to bytes; updates record_id column storage type.
src/orcapod/types.py Adds method-prefixed raw digest bytes API (to_prefixed_digest).
src/orcapod/semantic_types/universal_converter.py Maps fixed_size_binary Arrow types to Python bytes.
src/orcapod/semantic_types/semantic_struct_converters.py Adds UUIDStructConverter implementation.
src/orcapod/protocols/database_protocols.py Updates Arrow database protocol record_id types to bytes.
src/orcapod/protocols/core_protocols/datagrams.py Changes datagram_id protocol type to uuid.UUID.
src/orcapod/pipeline/status_observer.py Stores _status_id as binary(16) UUID bytes.
src/orcapod/pipeline/logging_observer.py Stores _log_id as binary(16) UUID bytes.
src/orcapod/hashing/versioned_hashers.py Registers UUIDStructConverter in default semantic registry.
src/orcapod/hashing/semantic_hashing/builtin_handlers.py Updates UUIDHandler to return UUID bytes.
src/orcapod/databases/utils.py Adds helper to coerce record_id from str→bytes.
src/orcapod/databases/postgresql_connector.py Adds value coercion helper; updates batch conversion pipeline; UUID mapping remains string.
src/orcapod/databases/noop_database.py Broadens record_id parameter types to accept bytes.
src/orcapod/databases/in_memory_databases.py Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/databases/delta_lake_databases.py Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/databases/connector_arrow_database.py Stores record IDs in binary Arrow columns; coerces input IDs to bytes.
src/orcapod/core/sources/stream_builder.py Makes deterministic per-row UUID record IDs and stores bytes in system tags.
src/orcapod/core/sources/cached_source.py Updates dedup hashing comment to reflect system-tag semantics.
src/orcapod/core/result_cache.py Reads/writes DB record IDs as UUID bytes; converts back to uuid.UUID on lookup.
src/orcapod/core/operators/merge_join.py Preserves original system-tag Python types (bytes vs str) in predicted schema.
src/orcapod/core/operators/join.py Preserves original system-tag Python types; updates sort key defaults for bytes record IDs.
src/orcapod/core/nodes/function_node.py Switches pipeline entry IDs and caches from str→bytes using to_prefixed_digest().
src/orcapod/core/datagrams/tag_data.py Changes datagram record_id parameter type to `uuid.UUID
src/orcapod/core/datagrams/datagram.py Normalizes UUIDv7 IDs to stdlib uuid.UUID on generation.
src/orcapod/core/data_function.py Stores record_id as uuid.UUID and source_info embeds UUID hex.
src/orcapod/init.py Minor formatting-only change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1056 to 1058
source_id_array = pa.array(source_ids, type=pa.large_string())
record_id_array = pa.array(record_ids, type=pa.large_string())
record_id_array = pa.array(record_ids, type=pa.large_binary())

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — changed add_system_tag_columns() to use pa.binary(16) for the record_id column. Record IDs in system tags are always UUID v5 bytes (exactly 16 bytes), so pa.binary(16) is the precise type. The docstring and the implementation now agree.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit e0bbc37. add_system_tag_columns() now uses pa.binary(16) for the record_id column — both the field declaration and the pa.array() call — matching the docstring.

Comment on lines 1064 to 1066
table = table.append_column(
pa.field(record_id_col_name, pa.large_string(), nullable=False), record_id_array
pa.field(record_id_col_name, pa.large_binary(), nullable=False), record_id_array
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — same change as above. add_system_tag_columns() now appends the record_id column as pa.binary(16) (both the field definition and the pa.array() call). The docstring was already correct; the implementation is now consistent with it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit e0bbc37 (same change as above — both the field and the array call now use pa.binary(16)).

Comment on lines +77 to +79
if t == "uuid":
return _pa.large_string()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — reverted _pg_type_to_arrow("uuid") back to pa.binary(16), and updated _coerce_pg_value to extract .bytes from psycopg's uuid.UUID objects instead of calling str(). Unit test (test_uuid) and integration test (test_all_pg_types) updated accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit e0bbc37. _pg_type_to_arrow("uuid") now returns pa.binary(16).

Comment on lines +167 to +170
# psycopg returns uuid columns as uuid.UUID objects; large_string() needs str
if arrow_type == _pa.large_string() and hasattr(value, "hex") and not isinstance(value, str):
return str(value)
return value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit e0bbc37. _coerce_pg_value() now checks arrow_type == pa.binary(16) and returns value.bytes (raw 16 bytes) instead of str(value).

Comment on lines 56 to 59
def test_uuid(self):
assert _pg_type_to_arrow("uuid", "uuid") == pa.large_string()
result = _pg_type_to_arrow("uuid", "uuid")
assert result == pa.large_string()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit e0bbc37. test_uuid now asserts pa.binary(16).

Comment on lines 196 to 199
assert infos["g"].arrow_type == pa.large_string()
assert infos["h"].arrow_type == pa.large_binary()
assert infos["i"].arrow_type == pa.large_string() # TODO: PLT-1162
assert infos["i"].arrow_type == pa.large_string()
assert infos["j"].arrow_type == pa.large_string()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit e0bbc37. The integration test now asserts infos["i"].arrow_type == pa.binary(16) for the UUID column.

Comment on lines +237 to +241
def __init__(self) -> None:
super().__init__("uuid")
self._python_type = _uuid_module.UUID
self._arrow_struct_type = pa.struct([pa.field("uuid", pa.binary(16))])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No code change needed — updated the PR description instead. Adding module-level constants to types.py would trigger PyArrow import at types import time, defeating the LazyModule pattern already used there. Each call site keeping its own pa.binary(16) literal is clear and avoids that footgun. The PR description table row for types.py now reflects this.

Comment thread src/orcapod/core/datagrams/datagram.py Outdated
Comment on lines 429 to 442
@property
def datagram_id(self) -> str:
def datagram_id(self) -> uuid.UUID:
"""Return (or lazily generate) the datagram's unique ID."""
if self._datagram_id is None:
self._datagram_id = str(uuid7())
# uuid_utils is used here because it provides UUIDv7 (time-ordered,
# monotonic) generation, which is not available in the stdlib before
# Python 3.12. However, uuid_utils.UUID and stdlib uuid.UUID are
# distinct types that do not compare equal even for identical bit
# patterns. We therefore normalise to stdlib uuid.UUID immediately
# so that the public API always returns a consistent type and
# equality / hashing work correctly regardless of how a UUID was
# originally produced.
self._datagram_id = uuid.UUID(bytes=uuid7().bytes)
return self._datagram_id

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is correct — datagram_id intentionally returns uuid.UUID (not bytes), as agreed in a prior review round. The PR description's Breaking Change note was wrong. Updated it to correctly state: datagram_iduuid.UUID; record_id/_log_id/_status_idbytes (16 bytes).

…op uuid module alias

- Remove `import uuid as _uuid_module` alias in data_function.py — use
  plain `import uuid` instead
- Change system-tag record_id column from pa.large_binary() to pa.binary(16)
  in arrow_utils.py: record IDs are always UUID v5 bytes (exactly 16 bytes),
  so a fixed-size type is more precise
- Revert PostgreSQL UUID → Arrow mapping from pa.large_string() back to
  pa.binary(16), and coerce uuid.UUID values via .bytes instead of str()
- Update unit and integration test assertions to match pa.binary(16)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Review round 3 — changes made

Three fixes in commit e0bbc37:

import uuid as _uuid_module alias removed (src/orcapod/core/data_function.py)
Plain import uuid now used throughout; call site updated to uuid.UUID(bytes=uuid7().bytes).

System-tag record_id column type changed to pa.binary(16) (src/orcapod/utils/arrow_utils.py)
add_system_tag_columns() now uses pa.binary(16) for both the field declaration and the pa.array() call. Record IDs in system tags are always UUID v5 bytes (exactly 16 bytes), so a fixed-size binary type is precise and prevents out-of-spec values.

PostgreSQL UUID mapping reverted to pa.binary(16) (src/orcapod/databases/postgresql_connector.py)
_pg_type_to_arrow("uuid") returns pa.binary(16) again. _coerce_pg_value extracts .bytes from psycopg's uuid.UUID objects (instead of calling str()). Unit test test_uuid and integration test test_all_pg_types updated to assert pa.binary(16).

All 3616 tests pass (excluding live-postgres integration tests, which require a running instance).

@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Copilot review round — changes made

Commit e0bbc37 addressed comments 1–6 (all were the same class of issue: pa.large_binary() / pa.large_string() where pa.binary(16) should be used):

  • arrow_utils.pyadd_system_tag_columns() record_id column changed from pa.large_binary() to pa.binary(16) (both the field declaration and the pa.array() call).
  • postgresql_connector.py_pg_type_to_arrow("uuid") reverted to pa.binary(16); _coerce_pg_value() updated to extract .bytes from psycopg uuid.UUID objects rather than calling str().
  • tests/test_databases/test_postgresql_connector.pytest_uuid now asserts pa.binary(16).
  • tests/test_databases/test_postgresql_connector_integration.pytest_all_pg_types UUID column assertion updated to pa.binary(16).

No code change for comments 7–8 — PR description corrections only:

  • Constants (UUID_ARROW_TYPE / UUID_STRUCT_ARROW_TYPE): Not added to types.py. Adding module-level constants would eagerly trigger PyArrow import at types import time, defeating the LazyModule pattern already used there. Each call site uses pa.binary(16) directly. PR description table updated to reflect this.
  • datagram_id type: The code is correct — datagram_id returns uuid.UUID (decided in a prior review round). The Breaking Change note in the PR description was wrong; updated to distinguish datagram_iduuid.UUID from record_id/_log_id/_status_idbytes.

The property returns a uuid.UUID, so the name datagram_uuid is more
descriptive. Renames the public property and protocol method, the
private backing field (_datagram_uuid), and all call sites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 7 comments.

Comment on lines +767 to +770
The Schema stores Python types. record_id system-tag columns are
stored as ``pa.large_binary()`` in Arrow, which maps to ``bytes`` in
the Python Schema. source_id columns are ``pa.large_string()`` →
``str``. After a two-way join both Python types must survive — no

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — docstring now says pa.binary(16) instead of pa.large_binary().

Comment thread src/orcapod/core/sources/stream_builder.py Outdated
Returns:
A deterministic ``uuid.UUID`` identifying this (source, row) pair.
"""
name = f"{source_id}::{provenance_token}"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — replaced f"{source_id}::{provenance_token}" with the length-prefixed form f"{len(source_id)}:{source_id}:{provenance_token}". Since the length of source_id is embedded, the parser can always locate the boundary exactly, even when either component contains : or ::. This is a pre-v0.1.0 hash change (intentional; no backward-compat obligation).

Comment on lines 784 to 787
@@ -783,7 +786,7 @@ def get_records_by_ids(
elif isinstance(record_ids, (pa.Array, pa.ChunkedArray)):
record_ids_list = cast(list[str], record_ids.to_pylist())

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — both the pl.Series and pa.Array/pa.ChunkedArray branches now apply coerce_record_id element-wise, matching the Collection branch. All three paths now produce a list[bytes].

Comment thread src/orcapod/databases/delta_lake_databases.py Outdated
Comment on lines +524 to +528
# The record_id segment is between the URI components and the key name
# Format: uri_part1:uri_part2:..::record_id::key
uuid_pattern = re.compile(
r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
)
assert uuid_pattern.search(source_str), f"No UUID found in {source_str!r}"
# Format: uri_part1:uri_part2:..::record_id_hex::key
# record_id is stored as a 32-char hex string (UUID7 bytes, no dashes)
uuid_hex_pattern = re.compile(r"[0-9a-f]{32}")
assert uuid_hex_pattern.search(source_str), f"No UUID hex found in {source_str!r}"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the test now splits source_str on "::" and validates parts[1] (the UUID hex segment) with re.fullmatch(r"[0-9a-f]{32}", ...) instead of searching the whole string. It also asserts that exactly 3 segments exist, making the format contract explicit.

Comment on lines 11 to 15
PostgreSQLConnector,
_arrow_type_to_pg_sql,
_coerce_pg_value,
_pg_type_to_arrow,
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TestCoercePgValue with 5 focused tests: None passthrough, uuid.UUID.bytes for pa.binary(16), uuid.UUID passed through unchanged for other Arrow types, raw bytes passthrough, and plain scalar passthrough. The import is now exercised by the test suite.

…D name, missing coercion, stale docs

- stream_builder: fix UUIDv5 name ambiguity — replace f"{source_id}::{token}"
  with length-prefixed f"{len(source_id)}:{source_id}:{token}" so pairs
  whose components contain "::" can never collide (pre-v0.1.0 intentional
  hash change)
- stream_builder: fix _make_record_id() docstring — pa.large_binary() →
  pa.binary(16)
- delta_lake_databases: fix get_records_by_ids() to coerce pl.Series and
  pa.Array inputs through coerce_record_id, matching the Collection branch
- delta_lake_databases: fix _create_record_ids_filter() type hint —
  list[str] → list[bytes]
- test_operators: fix docstring — pa.large_binary() → pa.binary(16)
- test_data_function: make UUID hex assertion precise — split on "::" and
  validate parts[1] exactly rather than searching anywhere in the string
- test_postgresql_connector: add TestCoercePgValue with 5 focused tests
  for _coerce_pg_value (None passthrough, UUID→bytes for binary(16),
  UUID passthrough for other types, bytes passthrough, scalar passthrough)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Copilot review round 2 — changes made (commit d0cbe5a)

UUIDv5 name ambiguity fixed (stream_builder.py)
_make_record_id() now uses the length-prefixed form f"{len(source_id)}:{source_id}:{provenance_token}" instead of f"{source_id}::{provenance_token}". A source_id whose length is embedded in the name gives an unambiguous parse boundary, so pairs that contain : or :: can no longer collide. Pre-v0.1.0 intentional hash change.

Missing coercion in get_records_by_ids() branches (delta_lake_databases.py)
The pl.Series and pa.Array/pa.ChunkedArray branches now apply coerce_record_id element-wise (same as the Collection branch), producing list[bytes] consistently across all input types.

_create_record_ids_filter() type hint (delta_lake_databases.py)
Parameter type updated from list[str] to list[bytes].

Stale docstrings (stream_builder.py, test_operators.py)
Both docstrings updated from pa.large_binary() to pa.binary(16).

More precise UUID hex assertion (test_data_function.py)
Test now splits on "::", validates exactly 3 segments, and uses re.fullmatch on parts[1] specifically — no longer susceptible to false positives from hex substrings in URI components.

_coerce_pg_value tests added (test_postgresql_connector.py)
New TestCoercePgValue class with 5 focused tests: None passthrough, uuid.UUID.bytes for pa.binary(16), uuid.UUID passed through unchanged for other types, raw bytes passthrough, and plain scalar passthrough.

The length-prefix encoding introduced in d0cbe5a was unnecessary.
In practice source_id is always a hex content hash (no "::" possible)
and provenance_token is "row_N" or "col=value" — the "::" separator
cannot appear in the hash path. A dedicated issue tracks formalising
this invariant with validation (see PLT-XXXX).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/orcapod/core/data_function.py Outdated

record_id = str(uuid7())
source_info = {k: combine(self.uri, (record_id,), (k,)) for k in output_data}
_uuid = uuid.UUID(bytes=uuid7().bytes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use more descriptive variable name to avoid possible confusion with the uuid module. Consider something like new_record_uuid

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to new_record_uuid throughout the function.

Comment thread src/orcapod/core/result_cache.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be uuid or bytes that we pass into record_id field for Data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was passing uuid.UUID — the intent was correct, the naming was misleading. Renamed the local variable to record_uuid and updated the keyword argument to Data(..., record_uuid=record_uuid).

Comment thread src/orcapod/core/datagrams/datagram.py Outdated
python_schema: SchemaLike | None = None,
meta_info: Mapping[str, DataValue] | None = None,
record_id: str | None = None,
record_id: uuid.UUID | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to record_uuid to clearly indicate it takes in uuid.UUID

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed record_idrecord_uuid in Datagram.__init__, _init_from_dict, _init_from_table, and propagated to Tag.__init__ and Data.__init__ as well.

Comment thread src/orcapod/core/result_cache.py Outdated
record_id = result_table.to_pylist()[0][RECORD_ID_COL]
record_id_bytes = result_table.to_pylist()[0][RECORD_ID_COL]
# Convert bytes back to uuid.UUID (stored as binary(16) in the DB)
record_id = uuid.UUID(bytes=bytes(record_id_bytes)) if record_id_bytes is not None else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly name this record_uuid

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed — local variable is now record_uuid, passed as record_uuid=record_uuid to Data.

if value is None:
return None
# psycopg returns uuid columns as uuid.UUID objects; binary(16) needs raw bytes
if arrow_type == _pa.binary(16) and hasattr(value, "bytes") and not isinstance(value, (bytes, bytearray)):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this actually do isinstance check against uuid.UUID?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with isinstance(value, uuid.UUID). Also added import uuid at module level so the check doesn't need a local import.

if raw is None:
raise ValueError("Missing 'uuid' field in struct dict")
content_hash = self._compute_content_hash(bytes(raw))
return self._format_hash_string(content_hash.digest, add_prefix=add_prefix)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear how this works -- isn't this supposed to be more like content_hash.to_string(add_prefix)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed _format_hash_string entirely — it was hardcoding sha256 in the format string independently of ContentHash.method, which would silently produce wrong output if the algorithm ever changes.

Replaced with _format_semantic_hash(content_hash, add_prefix) on SemanticStructConverterBase, which produces "{semantic_type_name}:{method}:{hex}" (e.g. "uuid:sha256:abc123") when add_prefix=True, and just the hex digest otherwise. The semantic type name prefix follows the pre-existing convention used by path converters.

Also fixed _compute_content_hash to use method="sha256" only — stuffing "{type}:sha256" into ContentHash.method was an abuse of that field's meaning. Filed PLT-1643 to evaluate whether the semantic-type prefix convention is correct at all, and whether the intermediate column type should be large_binary instead of large_string.

… isinstance, _format_hash_string

- rename _uuid → new_record_uuid in data_function.py
- rename local record_id → record_uuid in result_cache.py
- rename Datagram/Tag/Data __init__ parameter record_id → record_uuid
- postgresql_connector: replace duck-type hasattr check with isinstance(value, uuid.UUID)
- remove _format_hash_string from SemanticStructConverterBase (hardcoded sha256 independent
  of ContentHash.method); replace with _format_semantic_hash(content_hash, add_prefix) which
  delegates to ContentHash.to_string and prepends the semantic type name
- fix _compute_content_hash: method="sha256" only (semantic type name belongs in
  _format_semantic_hash, not stuffed into ContentHash.method)
- remove test_format_hash_string (tests deleted method)

Refs PLT-1643 for future decision on semantic-type prefix convention and large_string vs
large_binary in SemanticHashingVisitor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Response to eywalker review (round 3)

All 6 items addressed in commit 62c8f88.

1. _uuidnew_record_uuid (data_function.py) — renamed throughout the function.

2 & 4. record_idrecord_uuid (result_cache.py) — local variable renamed to record_uuid; keyword arg updated to Data(..., record_uuid=record_uuid).

3. record_id parameter → record_uuid (datagram.py) — renamed in Datagram.__init__, _init_from_dict, _init_from_table, and propagated to Tag.__init__ and Data.__init__.

5. Duck-type → isinstance (postgresql_connector.py) — replaced hasattr(value, "bytes") and not isinstance(value, (bytes, bytearray)) with isinstance(value, uuid.UUID). Added import uuid at module level.

6. Remove _format_hash_string (semantic_struct_converters.py) — removed entirely; it was hardcoding sha256 independently of ContentHash.method. Also fixed _compute_content_hash to use method="sha256" only (not "{type}:sha256" — that was an abuse of the field). Replaced both call sites with _format_semantic_hash(content_hash, add_prefix), a new helper on SemanticStructConverterBase that produces "{semantic_type_name}:{method}:{hex}" or just the hex, keeping the pre-existing type:algorithm:hex convention intact.

During the investigation, two related open questions surfaced and were filed as PLT-1643: (a) whether the semantic-type prefix in hash strings is actually necessary for correctness, and (b) whether the intermediate pa.large_string() column in SemanticHashingVisitor should be pa.large_binary() instead. Both are deferred to that issue.

digest = hashlib.sha256(content).digest()
return ContentHash(method="sha256", digest=digest)

def _format_semantic_hash(self, content_hash: ContentHash, add_prefix: bool) -> str:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here add_prefix is being ignored -- if no intention of using it, drop it from the argument. Otherwise be sure to include it when generating the final string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed entirely. add_prefix wasn't doing anything useful — _format_semantic_hash was hardcoding prefix_method=True regardless. hash_struct_dict now always returns the fully-prefixed form "{type}:sha256:<hex>" with no parameter.

content_hash = self._file_hasher.hash_file(path)
return self._format_hash_string(content_hash.digest, add_prefix=add_prefix)
file_hash = self._file_hasher.hash_file(path)
return self._format_semantic_hash(file_hash, add_prefix)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly add_prefix here seems unnecessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — dropped. hash_struct_dict now always returns the fully-prefixed "{type}:sha256:<hex>" form, making the parameter unnecessary.


content_hash = self._file_hasher.hash_file(path)
return self._format_hash_string(content_hash.digest, add_prefix=add_prefix)
file_hash = self._file_hasher.hash_file(path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of add_prefix = False is not correct here as it disagrees with the docstring. In case of add_prefix. = False, it should simply return file_hash.hex(False), right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed add_prefix entirely rather than trying to fix the inconsistent behavior. The visitor was the only caller and always passed add_prefix=True, so there was no value in keeping the parameter. hash_struct_dict now unconditionally returns "{type}:sha256:<hex>", and the visitor just calls hash_struct_dict(data).

hash_struct_dict always returns the fully-prefixed form
"{type}:sha256:<hex>", so the add_prefix parameter was redundant and
misleading. Remove it from the protocol, both converter implementations,
the visitor call site, and all tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Round 4 review — addressed

Removed add_prefix from hash_struct_dict entirely (commit 2fab1b8).

The add_prefix parameter was simultaneously being ignored (hardcoded prefix_method=True inside _format_semantic_hash) and redundant (the visitor was the only caller and always passed True). Rather than patching the inconsistency, removed the parameter across the board:

  • SemanticStructConverterBase._format_semantic_hash — no longer takes add_prefix; always returns "{type}:{method}:{hex}"
  • PathStructConverterBase.hash_struct_dict — signature simplified to (self, struct_dict)
  • UUIDStructConverter.hash_struct_dict — signature simplified to (self, struct_dict)
  • SemanticStructConverterProtocol.hash_struct_dict — protocol updated to match
  • SemanticHashingVisitor — call site changed from hash_struct_dict(data, add_prefix=True) to hash_struct_dict(data)
  • Tests updated throughout (including test_file_hashing_consistency.py which was comparing against raw hex and needed updating)

All 3620 tests pass.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 54 changed files in this pull request and generated 4 comments.

Comment on lines +70 to +74
Returns:
A deterministic ``uuid.UUID`` identifying this (source, row) pair.
"""
name = f"{source_id}::{provenance_token}"
return uuid.uuid5(_SOURCE_RECORD_ID_NAMESPACE, name)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing back on this one. The revert in b9a6934 was intentional: source_id is always a hex content hash (characters 0-9, a-f) and provenance_token is always "row_N" or "col=value" format — neither can contain ::. A future issue tracks adding explicit validation to enforce this invariant. No change here.

connector = SQLiteConnector(":memory:") # PLT-1076
db = ConnectorArrowDatabase(connector)
db.add_record(("results", "my_fn"), record_id="abc", record=table)
db.add_record(("results", "my_fn"), record_id=b"\\x00" * 16, record=table)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 1886a4e. Changed to b"sha256:abc" to make clear record IDs are variable-length bytes (method-prefixed content hashes), not fixed-size UUIDs.

Comment on lines 238 to 242
self._pending_batches[record_key] = pa.concat_tables(
[existing_pending, records]
)
pending_ids = cast(list[str], records[self.RECORD_ID_COLUMN].to_pylist())
pending_ids = cast(list[bytes], records[self.RECORD_ID_COLUMN].to_pylist())
self._pending_record_ids[record_key].update(pending_ids)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix as above — InMemoryArrowDatabase.add_records() now raises TypeError early for non-binary record-id columns. See reply on the connector_arrow_database.py comment for full details.

Comment on lines 238 to +240
records = self._deduplicate_within_table(records)
record_key = self._get_record_key(record_path)
input_ids = set(cast(list[str], records[self.RECORD_ID_COLUMN].to_pylist()))
input_ids = set(cast(list[bytes], records[self.RECORD_ID_COLUMN].to_pylist()))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 1886a4e. Both ConnectorArrowDatabase.add_records() and InMemoryArrowDatabase.add_records() now raise TypeError immediately if the record-id column is not pa.binary() or pa.large_binary(). The cast(list[bytes], ...) annotation was hiding the silent failure — hard validation is consistent with the rest of the database layer's strict type contract.

Two internal callers that were building the hash column as pa.large_string() were also fixed: operator_node.py and cached_source.py now encode hash strings to bytes (h.encode()) so the stored column is pa.large_binary(). All test record IDs updated from str to bytes.

…ds()

String-typed record-id columns were silently accepted but caused
deduplication to always pass (set[str] & set[bytes] == empty set).
Both InMemoryArrowDatabase and ConnectorArrowDatabase now raise TypeError
immediately if the record-id column is not pa.binary() or pa.large_binary().

The two internal callers that were building the hash column as
pa.large_string() are fixed to encode the hash strings to bytes
(operator_node.py and cached_source.py). All tests updated to use
bytes record IDs throughout.

Also changes the docstring example in connector_arrow_database.py from
b"\x00" * 16 (UUID-sized) to b"sha256:abc" to make clear record IDs are
variable-length bytes, not fixed-size UUIDs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Copilot review round — addressed (commit 1886a4e)

Comment 1 — UUIDv5 name ambiguity (stream_builder.py): Declined. The :: separator is safe here: source_id is always a hex content hash (hex chars only, no ::) and provenance_token is always "row_N" or "col=value" format. This was intentionally reverted in b9a6934 after an earlier round over-fixed it. A future issue tracks formalising this with explicit input validation.

Comments 2 & 3 — String record-id column accepted silently (add_records() in both DB implementations): Fixed with hard type validation. Both InMemoryArrowDatabase and ConnectorArrowDatabase now raise TypeError at the top of add_records() if the record-id column is not pa.binary() or pa.large_binary(). The cast(list[bytes], ...) annotations were lying — they were type-hint-only and provided no runtime protection. This matches the strict type contract the rest of the database layer already enforces.

Two internal callers that were constructing the hash column as pa.large_string() were corrected: operator_node.py and cached_source.py now build the column as pa.large_binary() by encoding the hash hex strings via h.encode(). All database tests updated to use bytes record IDs.

Comment 4 — Docstring example looks UUID-specific: Fixed. Changed b"\x00" * 16 to b"sha256:abc" to make clear record IDs are variable-length bytes, not fixed-size UUIDs.

@eywalker eywalker merged commit f572dc9 into main Jun 13, 2026
11 checks passed
@eywalker eywalker deleted the eywalker/plt-1162-design-spike-uuid-arrow-type-mapping-large_string-vs branch June 13, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants